College Majors

About the Data

All data is from American Community Survey 2010-2012 Public Use Microdata Series. The data contains 5 files segregated based on level of education and age

As my focus is towards answering questions primarily aligned with graduate degree vs non-graduate degree, I plan to make use of the files majors-list.csv, recent-grads.csv and grad-students.csv to begin with and bring in others when required.

majors_list description

Header Description
FOD1P Recoded field of degree - first entry
Major_code Major code, FO1DP in ACS PUMS
Major Major description

recent_grads description

Header Description
Rank Rank by median earnings
Major_code Major code, FO1DP in ACS PUMS
Major Major description
Major_category Category of major from Carnevale et al
Total Total number of people with major
Sample_size Sample size (unweighted) of full-time, year-round ONLY (used for earnings)
Men Male graduates
Women Female graduates
ShareWomen Women as share of total
Employed Number employed (ESR == 1 or 2)
Full_time Employed 35 hours or more
Part_time Employed less than 35 hours
Full_time_year_round Employed at least 50 weeks (WKW == 1) and at least 35 hours (WKHP >= 35)
Unemployed Number unemployed (ESR == 3)
Unemployment_rate Unemployed / (Unemployed + Employed)
Median Median earnings of full-time, year-round workers
P25th 25th percentile of earnings
P75th 75th percentile of earnings
College_jobs Number with job requiring a college degree
Non_college_jobs Number with job not requiring a college degree
Low_wage_jobs Number in low-wage service jobs

grad_students description

Header Description
Major_code Major code, FO1DP in ACS PUMS
Major Major description
Major_category Category of major from Carnevale et al
Grad_total Total number of graduate students with major
Grad_sample_size Graduate students sample size (unweighted) of full-time, year-round ONLY (used for earnings)
Grad_employed Number of graduate students employed (ESR == 1 or 2)
Grad_full_time_year_round Employed at least 50 weeks (WKW == 1) and at least 35 hours (WKHP >= 35)
Grad_unemployed Number of graduate students unemployed (ESR == 3)
Grad_unemployment_rate Graduate students Unemployed / (Unemployed + Employed)
Grad_median Median earnings of Graduate full-time, year-round workers
Grad_P25th 25th percentile of graduate earnings
Grad_P75th 75th percentile of graduate earnings
Nongrad_total Total number of graduate students with major
Nongrad_employed Number of graduate students employed (ESR == 1 or 2)
Nongrad_full_time_year_round Employed at least 50 weeks (WKW == 1) and at least 35 hours (WKHP >= 35)
Nongrad_unemployed Number of graduate students unemployed (ESR == 3)
Nongrad_unemployment_rate Graduate students Unemployed / (Unemployed + Employed)
Nongrad_median Median earnings of Graduate full-time, year-round workers
Nongrad_P25th 25th percentile of graduate earnings
Nongrad_P75th 75th percentile of graduate earnings

Why the dataset?

I found the data interesting and wanted to understand and answer what effect does graduate degree and the major of graduate degree have on employment and the pay

Quality Checks on data

library(tidyverse)
library(naniar)

Load data

grad_students <- read.csv("./grad-students.csv")
majors_list <- read.csv("./majors-list.csv")
recent_grads <- read.csv("./recent-grads.csv")
#convert df to tibble
grad_students <- as_tibble(grad_students)
majors_list <- as_tibble(majors_list)
recent_grads <- as_tibble(recent_grads)

Check for missing values

vis_miss(grad_students)

vis_miss(majors_list)

vis_miss(recent_grads)

Identify the outliers and spread of median salary for each major category

ggplot(recent_grads, aes(x = Median, y = Major_category), na.rm =TRUE) +
  geom_boxplot(width = 0.4, fill = "white") +
  geom_jitter(aes(color = Major_category), 
              width = 0.1, size = 0.5) + labs(y = "Major Category", x ="Median income people pursuing in each major")

Employment and median salary based on major category

options(scipen=999)
ggplot(recent_grads, aes(x=Employed, fill=Major_category)) +
    geom_histogram( color="#e9ecef", alpha=0.6, position = 'identity', bins=10) +
    labs(fill="")

ggplot(recent_grads, aes(x=Median, colour = Major_category)) +
  geom_freqpoly(binwidth = 10000) + scale_fill_brewer(palette = "Paired")

Identify the outliers and spread of median salary for each major category for a graduate

ggplot(grad_students, aes(x = Grad_median, y = Major_category), na.rm =TRUE) +
  geom_boxplot(width = 0.4, fill = "white") +
  geom_jitter(aes(color = Major_category ), 
              width = 0.1, size = 0.5) + labs(y = "Major Category", x ="Income based on graduate major")

Identify the outliers and spread of median salary for each major category for a non-graduate

ggplot(grad_students, aes(x = Nongrad_median, y = Major_category), na.rm =TRUE) +
  geom_boxplot(width = 0.4, fill = "white") +
  geom_jitter(aes(color = Major_category ), 
              width = 0.1, size = 0.5) + labs(y = "Major Category", x ="Income based on undergraduate major")

Employment based on major category for graduate

ggplot(grad_students, aes(x=Grad_employed, fill=Major_category)) +
    geom_histogram( color="#e9ecef", alpha=0.6, position = 'identity', bins=10)+
   labs(x ="Income based on graduate major", fill="")

ggplot(grad_students, aes(x=Grad_median, colour = Major_category)) +
  geom_freqpoly(binwidth = 10000)

Employment based on major category for non-graduate

ggplot(grad_students, aes(x=Nongrad_employed, fill=Major_category)) +
    geom_histogram( color="#e9ecef", alpha=0.6, position = 'identity', bins=10) +
    labs(x ="Income based on undergraduate major", fill="")

ggplot(grad_students, aes(x=Nongrad_median, colour = Major_category)) +
  geom_freqpoly(binwidth = 10000)

Proposal of Questions

  1. Does pursuing a graduate degree help in improving job opportunity and higher salaries
  2. MEN vs WOMEN preference towards majors and job distribution
  3. Popular vs unpopular majors with highest pays
  4. How important is major in a major category and how does it effect employment and salary
  5. Employment and Salary comparison between STEM and Arts programs

Initial Visualizations

ggplot(recent_grads, aes(x = Major_category, y= Median, fill = Men)) + 
  geom_bar(stat = "identity", position = "dodge") + theme(axis.text.x = element_text(angle = 60, hjust = 1))

ggplot(recent_grads, aes(x = Major_category, y= Median, fill = Women)) + 
  geom_bar(stat = "identity", position = "dodge") + theme(axis.text.x = element_text(angle = 60, hjust = 1))

#Number of men and women in each Major category
ggplot(recent_grads, aes(Men , Women)) + 
  geom_point() + 
  stat_smooth() +
  facet_wrap(~Major_category)

#Grad and Nongrad median salary across all major categories
ggplot(grad_students, aes(Grad_median , Nongrad_median)) + 
  geom_point() + 
  stat_smooth() +
  facet_wrap(~Major_category)

Median Income of Each major in each Major category

major_categories_lst <- unique(majors_list$Major_Category)
for (major_cat in major_categories_lst){
  if (!is.na(major_cat)){
    filtered_data <- filter(recent_grads, Major_category == major_cat)
    print(ggplot(filtered_data, aes(x=Median, fill=Major)) +
    geom_histogram( color="#e9ecef", alpha=0.6, position = 'identity', bins=20) +
    labs(x="Median Salary", fill="", title= major_cat))
  }
}

for (major_cat in major_categories_lst){
  if (!is.na(major_cat)){
    filtered_data <- filter(grad_students, Major_category == major_cat)
    print(ggplot(filtered_data, aes(x=Grad_median, fill=Major)) +
    geom_histogram( color="#e9ecef", alpha=0.6, position = 'identity', bins=20) +
    labs(x="Median Salary", fill="", title= major_cat))
  }
}